Amendments to the Specification : 

A Substitute Specification is submitted herewith. A marked-up copy of the 
original specification containing the amendments made in the Substitute 
Specification is submitted herewith. It is submitted that the amendments to the 
Substitute Specification do not introduce new matter. 
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Field of the Invention 



[0001] The present invention relates to speech processing, and more 
particularly to a voicing determination of the speech signal having a particular, but 
not exclusive, application to the field of mobile telephones. 



[0002] In known speech codecs the most common phonetic classification is a 
voicing decision, which classifies a speech frame as voiced or unvoiced. 
Generally speaking, voiced segments are typically associated with high local 
energy and exhibit a distinct periodicity corresponding to the fundamental 
frequency, or iequivalently pitch, of the speech signal, whereas unvoiced 
segments resemble noise. However, a speech signal also contains segments, 
which can be classified as a mixture of voiced and unvoiced speech where both 
components are present simultaneously. This category includes voiced fricatives 
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and breathy and creaky voices. The appropriate classification of mixed segments 
as either voiced or unvoiced depends on the properties of the speech codec. 

[0003] In a typical known analysis-by-synthesis (A-b-S) based speech codec, 
the periodicity of speech is modelled with a pitch predictor filter, also referred to as 
a long-term prediction (LTP) filter. It characterizes the harmonic structure of the 
spectrum based on the similarity of adjacent pitch periods in a speech signal. The 
most common method used for pitch extraction is the autocorrelation analysis, 
which indicates the similarity between the present and delayed speech segments. 
In this approach the lag value corresponding to the major peak of the 
autocorrelation function is interpreted as the pitch period. It is typical that for 
voiced speech segments with a clear pitch period the voicing determination is 
closely related to pitch extraction. 

SUMMARY OF THE INVENTION 

[0004] According to a first aspect of the present invention there is provided a 
method for determining the voicing of a speech signal segment, comprising the 
steps of: dividing a speech signal segment into sub-segments, determining a 
value relating to the voicing of respective speech signal sub-segments, comparing 
said values with a predetermined threshold, and making a decision on the voicing 
of the speech segment based on the number of the values on one side of the 
threshold. 
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[0005] According to a second aspect of the present invention there is provided 
a device for deternnining the voicing of a speech signal segment, comprising 
means (106) for dividing a speech signal segment into sub-segments, means 
(110) for determining a value relating to the voicing of respective speech signal 
sub-segments, means (112) for comparing said values with a predetermined 
threshold and means (112^ for making a decision on the voicing of the speech 
segment based on the number of the values on one side of the threshold. 

[0006] The invention provides a method for voicing determination to be used 
particularly, but not exclusively, in a narrow-band speech coding system. The 
invention addresses the problems of prior art by determining the voicing of the 
speech segment based on the periodicity of its sub-segments. The embodiments 
of the present invention give an improvement in the operation in a situation where 
the properties of the speech signal vary rapidly such that the single parameter set 
computed over a long window does not provide a reliable basis for voicing 
determination. 

[0007] A preferred embodiment of the voicing determination of the present 
invention divides a segment of speech signal further into sub-segments. Typically 
the speech signal segment comprises one speech frame. Furthermore, it may 
optionally include a possible lookahead which is a certain portion of the speech 
signal from the next speech frame. A normalized autocorrelation is computed for 
each sub-segment. The normalized autocorrelation values of the sub-segments 
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are forwarded to classification logic, which compares the sub-segments to the 
predefined threshold value. In this embodiment, if a certain percentage of 
normalized autocorrelation values exceeds a threshold, the segment is classified 
as voiced. 

[0008] In one embodiment of the present invention, a normalized 
autocorrelation is computed for each sub-segment using a window whose length 
is proportional to the estimated pitch period. This ensures that a suitable number 
of pitch periods is included to the window. 

[0009] In addition to the above, a critical design problem in voicing 
determination algorithms is the correct classification of transient frames. This is 
especially true in transients from unvoiced to voiced speech as the energy of the 
speech signal is usually growing, if no separate algorithm is designed for 
classifying the transient frames, the voicing determination algorithm is always a 
compromise between the misclassification rate and the sensitivity to detecting 
transient frames appropriately. 

[0010] To improve the performance of the voicing determination algorithm 
during transient frames without increasing the misclassification rate practically at 
all, one embodiment of the present invention provides rules for classifying the 
speech frame as voiced. This is done by emphasizing the voicing decisions of the 
last sub-segments in a frame to detect the transients from unvoiced to voiced 
speech. That is, in addition to having a certain number of sub-segments having a 
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normalized autocorrelation value exceeding a threshold value, the frame is 
classified as voiced also if all of a predetermined number of the last sub-segments 
have a normalized autocorrelation value exceeding the same threshold value. 
Detection of unvoiced to voiced transients is thus further improved by 
emphasizing the last sub-segments in the classification logic. 

[0011] The frame may be classified as voiced if only the last sub-segment has 
a normalized autocorrelation value exceeding the threshold value. 

[0012] Alternatively, the frame may be classified as voiced if a portion of the 
subsegments out of the whole speech frame have a normalized autocorrelation 
value exceeding the threshold, The portion may. for example be substantially a 
half, or substantially a third of the sub-segments of the speech frame. 

[0013] The voiced/unvoiced decision can be used for two purposes. One 
option is to allocate bits within the speech codec differently for voiced and 
unvoiced frames. In general, voiced speech segments are perceptually more 
important than unvoiced segments and thus it is especially important that a 
speech frame is correctly classified as voiced. In the case of A-b-S type of codec, 
this can be done for example by re-allocating bits from the adaptive codebook (for 
example from LTP-gain and LTP-lag parameters) to the excitation signal when the 
speech frame is classified as unvoiced to improve the coding of the excitation 
signal. On the other hand the adaptive codebook in a speech codec can then be 
even switched off during the unvoiced speech frame which will lead to reduced 
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total bit rate. Because of this on/off switching of LTP-parameters it is especially 
important that a speech frame is correctly classified as voiced. It has been noticed 
that, if a voiced speech frame is incorrectly classified as unvoiced and the LTP 
parameters are switched off. this leads to a decreased sound quality at the 
receiving end. Accordingly, the present invention provides a method and device 
for a voiced/unvoiced decision to make a reliable decision, especially, so that 
voiced speech frames are not incorrectly decided as unvoiced. 

BRIEF DESCRIPTION OF THE DRAWINGS 
[0014] Exemplary embodiments of the invention are hereinafter described with 
the reference to the accompanying drawings, in which: 

[0015] Fig. 1 shows a block diagram of an apparatus of the present invention; 
[0016] Fig. 2 shows a speech signal framing of the present invention; 
[0017] Fig. 3 shows a flow diagram in accordance with the present invention; 
and 

[0018] Fig. 4 shows a block diagram of a radiotelephone utilizing the invention. 

DETAILED DESCRIPTION OF THE DRAWINGS 
[0019] Fig. 1 shows a device I for voicing determination according to the first 
embodiment of the present invention. The device comprises a microphone 101 for 
receiving an acoustical signal 102, typically a voice signal, generated by a user, 
and converting it into an analog electrical signal at line 103. An AID converter 104 
receives the analog electrical signal at line 103 and produces a digital electrical 
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signal y(t) of the user's voice at line 105. A segmentation block 106 then divides 
speech signal to predefined sub-segments at line 107. A frame of 20 ms (160 
samples) can for example divided into 4 sub-segments of 5 ms. After 
segmentation a pitch extraction block 108 extracts the optimum open-loop pitch 
period for each speech sub-segment. The optimum open-loop pitch is estimated 
by minimizing the sum-squared error between the speech segment and its 
delayed and gain-scaled version as following: 



where y(t) is the first speech sample belonging to the window of length N, t is the 
integer pitch period and g(t) is the gain. 

[0020] The optimum value of g(t) is found by setting the partial derivative of the 
cost function (1 ) with respect to the gain equal to zero. This yields 



Jit, T, git)) = (yit + i)-git)yit + i - r))' 
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where 



Rit,T)=Y,yit + i)yit + i-T) 



(3) 
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is the autocorrelation of y(t) with delay t and 



(4) 



[0021] By substituting the optimum gain to equation (1 ), the pitch period is 
estimated by maximizing the latter term of 



with respect to delay t. The pitch extraction block 108 is also arranged to send 
the above determined estimated open-loop pitch estimate t at line 113 to the 
segmentation block 106 and to a value determination block 110. An example of 
the operation of the segmentation is shown in Fig. 2, which is described later. 

[0022] The value determination block 110 also receives the speech signal y(t) 
from the segmentation block 106 at line 107. The value determination block 1 10 is 
arranged to operate as follows: 

[0023] To eliminate the effects of the negative values of the autocorrelation 
function when maximizing the function, a square root of the latter term of 



R{t-T) 



(5) 





equation (5) is taken. The term to be nnaximized is thus: 



C,(t,T) = R(t,T)/^R(t-T) 



(6) 



[0024] During voiced segments, the gain g(t) tends to be near unity and thus it 
is often used for voicing determination. However, during unvoiced and transient 
regions, the gain g(t) fluctuates achieving also values near unity. A more robust 
voicing determination is achieved by observing the values of equation (6). To cope 
with the power variations of the signal. R(t,T) is normalized to have a maximum 
value of unity resulting: 



[0025] According to one aspect of the invention, the window length in (7) is set 
to the found pitch period r plus some offset M to overcome the problems related to 
a fixed-length window. The periodicity measure used is thus 





(8) 
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where 

K (t, r) = X y(( + i)y(t = i-T) (9) 

1=0 

and 

K(t)= R^(t,0)= + 0 (10) 

1=0 

[0026] The parameter M can be set, e.g. to 10 sannples. A voicing decision 
block 112 is to receive the above determined periodicity measure C2(t, r) at 
line 1 1 1 from the value determination block 110 and parameters K, Ktr, Ctr to make 
the voicing decision. The decision logic of voiced/unvoiced decision is further 
described in Fig. 3 below. 

[0027] It should be emphasized that the pitch period used in (8) can also be 
estimated in other ways than described in equations (1) - (6) above. A common 
modification is to use pitch tracking in order to avoid pitch multiples described in a 
Finnish patent application Fl 971976. Another optional function for the open-loop 
pitch extraction is that the effect of the formant frequencies is removed from the 
speech signal before pitch extraction. This can be done for example by a 
weighting filter. 
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[0028] Modified signals for example a residual signal, weighted residual signal 
or weiglited speech signal, can also be used for voicing determination instead of 
the original speech signal. The residual signal is obtained by filtering the original 
speech signal by a linear prediction analysis filter. 

[0029] It may also be advantageous to estimate the pitch period from the 
residual signal of the linear prediction filter instead of the speech signal, because 
the residual signal is often more clearly periodic. 

[0030] The residual signal can be further low-pass filtered and down-sampled 
before the above procedure. Down-sampling reduces the complexity of correlation 
computation. In one further example, the speech signal is first filtered by a 
weighting filter before the calculation of autocorrelation is applied as described 
above. 

[0031] Fig. 2 shows an example of dividing a speech frame into four sub- 
segments whose starting positions are tl, t2, t3 and t4. The window lengths Nl, 
N2, N3 and N4 are proportional to the pitch period found as described above. The 
lookahead is also utilized in the segmentation. In this example, the number of sub- 
segments is fixed. Alternatively the number of subsegments can variable based 
on the pitch period. This can be done for example by selecting the subsegments 
by t2=t1 + r + L, t3= t2 + r + L, etc. until all available data is utilized. In this 
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example L is constant and can be set e.g. —10 resulting overlapping sub- 
segments. 

[0032] Fig. 3 shows a flow diagram of the method according to one 
embodiment of the present invention. The procedure is started by step 301 where 
the open-loop pitch period ~r is extracted as exemplified above in equations (1) — 
(6). At step 302 C2(t, t) is calculated for each sub-segment of the speech as 
described in equation (8). Next at step 303, the number of sub-segments n is 
calculated where C2(t, t) is above a certain first threshold value Ctr- The 
comparator 304 determines whether the number of sub-segments n, determined 
at step 303, exceeds a certain second threshold value K. If the second threshold 
value K is exceeded the speech frame is classified as voiced. Otherwise the 
procedure continues to step 305. In this embodiment, at step 305 the comparator 
detennines if a certain number Ktr of last subsegments have a value C2(t. t) 
exceeding the threshold Ctr- If the threshold is exceeded the speech frame is 
classified as a voiced frame. Otherwise the speech frame is classified as unvoiced 
frame. 

[0033] The exact parameter values Ctr, Ktr and K presented above are not 
limited to certain values but are dependent on the system specified and can be 
selected empirically using a large speech database. For example, if the speech 
segment is divided into 9 sub-segments, suitable values can be for example 
Ctr, = 0.6, Ktr= 4 and K= 6. An appropriate value of K and Ktr is proportional to the 
number of sub-segments. 
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[0034] Alternatively, according to the present invention, the frame is classified 
as voiced if only the last sub-segment (i.e. Ktr = 1) has a normalized 
autocorrelation value exceeding the threshold value. According to still one 
modification the frame is classified as voiced if substantially half of the sub- 
segments out of the whole speech frame (e.g. 4 or 5 sub-segments out of 9) have 
a normalized autocorrelation value exceeding the threshold. 

[0035] Fig. 4 is a block figure of a radiotelephone including the parts of the 
present invention. The radiotelephone comprises of a microphone 61. keypad 62, 
display 63. speaker 64 and antenna 71 with switch for duplex operation. Further 
included is a control unit 65. implemented for example in an ASIC circuit, for 
controlling the operation of the radiotelephone. Fig. 4 also shows the transmission 
and reception blocks 67, 68 including speech encoder and decoder blocks 69, 70. 
The device for voicing determination 1 is preferably included within the speech 
encoder 69. Alternatively the voicing determination can be implemented 
separately, not within the speech encoder 89. The speech encoder/decoder 
blocks 69, 70 and the voicing determination 1 can be implemented by a DSP 
circuit including known elements such as internal/external memories and 
registers, for implementing the present invention. The speech encoder/decoder 
can be based on any standard/technology and the present invention thus forms 
one part for the operation of such codec. The radiotelephone itself can operate in 
any existing or future telecommunication standard based on digital technology. 
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[0036] To improve the performance of the voicing determination algorithm, the 
last sub-segments are emphasized and specifically the performance of the voicing 
determination algorithm in unvoiced to voiced transients is emphasized including if 
all of a predetermined number of the last sub-segments have a normalized 
authorization value exceeding the same threshold value. 

[0037] In the view of foregoing description it will be evident to a person skilled 
in the art that various modifications may be made within the scope of the present 
invention. 
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